Search CORE

99 research outputs found

Language Models for Image Captioning: The Quirks and What Works

Author: Cheng Hao
Deng Li
Devlin Jacob
Fang Hao
Gupta Saurabh
He Xiaodong
Mitchell Margaret
Zweig Geoffrey
Publication venue
Publication date: 14/10/2015
Field of study

Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-of-the-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.Comment: See http://research.microsoft.com/en-us/projects/image_captioning for project informatio

arXiv.org e-Print Archive

CiteSeerX

LSTM TIME AND FREQUENCY RECURRENCE FOR AUTOMATIC SPEECH RECOGNITION

Author: Abdelrahman Mohamed
Geoffrey Zweig
Jinyu Li
Yifan Gong
Publication venue
Publication date: 01/05/2020
Field of study

ABSTRACT Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward neural networks (DNNs). A key aspect of these models is the use of time recurrence, combined with a gating architecture that ameliorates the vanishing gradient problem. Inspired by human spectrogram reading, in this paper we propose an extension to LSTMs that performs the recurrence in frequency as well as in time. This model first scans the frequency bands to generate a summary of the spectral information, and then uses the output layer activations as the input to a traditional time LSTM (T-LSTM). Evaluated on a Microsoft short message dictation task, the proposed model obtained a 3.6% relative word error rate reduction over the T-LSTM

CiteSeerX